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ABSTRACT 

This  paper  considers  a  discrete-time  system  composed  of  K  infinite  capacity  queues  that  com¬ 
pete  for  the  use  of  a  single  server.  Customers  arrive  in  i.i.d  batches  and  are  served  according  to  a 
server  allocation  policy.  Upon  completing  service,  customers  either  leave  the  system  or  are  routed 
instantaneously  to  another  queue  according  to  some  random  mechanism.  As  an  alternative  to  sim¬ 
ply  randomized  strategies,  a  policy  based  on  a  Stochastic  Approximation  algorithm  is  proposed  to 
drive  a  long-run  average  cost  to  a  given  value.  The  motivation  can  be  traced  to  implementation 
issues  associated  with  constrained  optimal  strategies. 

A  version  of  the  ODE  method  as  given  by  Metivier  and  Priouret  is  developed  for  proving  a.s. 
convergence  of  this  algorithm.  This  is  done  by  exploiting  the  recurrence  structure  of  the  system 
under  non-idling  policies.  A  probabilistic  representation  the  solutions  to  an  associated  Poisson 
equation  is  found  most  useful  for  proving  their  requisite  Lipschitz  continuity.  The  conditions  which 
guarantee  convergence  are  given  directly  in  terms  of  the  model  data.  The  approach  is  of  independent 
interest,  as  it  is  not  limited  to  this  particular  queueing  application  and  suggests  a  way  of  attacking 
other  similar  problems. 
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I.  INTRODUCTION 


In  recent  years,  there  has  been  widespread  interest  in  Stochastic  Approximation  algorithms 
as  a  means  to  solve  complex  engineering  problems  [5,15].  As  a  result  of  the  increasing  complexity 
of  applications,  focus  has  shifted  from  the  original  Robbins-Monro  algorithm  to  more  sophisti¬ 
cated  versions,  and  this  has  led  in  particular  to  the  study  of  projected  Stochastic  Approximation 
algorithms  driven  by  Markovian  “noise”  or  “state”  processes. 

Such  algorithms  have  the  following  form.  Let  {r?(n),  n  —  0, 1, . . .}  be  the  sequence  of  iterates 
which  take  values  in  a  compact  convex  subset  G'of  IR,P,  and  let  {X(n),  n  —  0, 1, . . .}  be  the  state 
process  which  takes  values  in  some  Borel  subset  5  of  IRfc.  They  are  related  by  the  recursion 

7/(n+  1)  =  JlG{r)(n)  +  an+1f(r)(n),X(n+  1))}  n  as  0,1,... (1.1) 

with  7/(0)  in  G,  where  IIq  denotes  the  nearest-point  projection  on  G,  /  is  a  Borel  mapping  GxS  —*■ 
IRP  and  the  step  size  sequence  {an+i,  n  =  0, 1, . . .}  satisfies  the  conditions  0  <  an  [  0,  YnLo  an  —  oo 
and  Y^n Lo  an  <  oo.  A  complete  specification  of  the  algorithms  (1.1)  requires  that  the  conditional 
probability  distribution  of  X(n  +  1)  given  (X(0),  r?(0),X(l), ...,X(n),  77(77))  be  postulated  for  each 
n  =  0, 1, . . ..  For  instance,  the  Markovian  dependencies  alluded  to  earlier  require 

P[Xn+1  €  B\X (0), *7(0), Jl(1), . . . , X (n), 17(71)]  =  ^(n) (X (n); B)  n  =  0,1,... (1.2) 

for  every  Borel  subset  B  of  5,  where  {//,,,  77  6  G}  is  a  family  of  one-step  probability  transition 
kernels  on  S. 

The  central  question  in  the  theory  of  Stochastic  Approximations  is  concerned  with  the  conver¬ 
gence  properties  of  the  iterate  sequence  {77(71),  n  =  0,1,...}.  For  the  Robbins-Monro  algorithm, 
direct  martingale  arguments  have  been  given  by  Gladyshev  [9]  to  establish  a.s.  convergence.  How¬ 
ever,  in  more  complex  situations  such  as  (1.2),  the  direct  probabilistic  approach  does  not  work, 
and  this  failure  prompted  the  development  of  the  so-called  ODE  method.  In  all  its  forms,  the 
ODE  method  proceeds  in  two  separate  steps.  The  first  step  relies  on  the  Kushner-Clark  Lemma  in 
order  to  identify  a  deterministic  ODE,  the  stability  properties  of  which  determine  the  limit  points 
of  (77(71),  7i  =  0, 1, . . .}.  The  second  step  is  probabilistic  in  nature  and  depends  on  the  algorithm 
being  considered;  its  purpose  is  to  show  that  asymptotically  (in  the  mode  of  convergence  of  interest) 
the  output  sequence  of  the  original  algorithm  behaves  like  the  solution  to  the  ODE. 

In  their  monograph  [14],  Kushner  and  Clark  have  given  general  conditions  for  successfully 
completing  this  second  step.  In  more  structured  situations  [15],  Kushner  has  shown  how  weak 
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convergence  methods  -  through  various  tightness  properties  -  pave  the  way  to  convergence  in 
probability  of  the  sequence  {r?(n),  n  =  0, 1, . . In  the  Markovian  case,  Metivier  and  Priouret  [20] 
have  established  a.s.  convergence  by  making  use  of  properties  of  the  Poisson  equation  associated 
with  the  transition  kernels  {pv,  t)  £  G}  appearing  in  (1.2).  Key  to  their  analysis  are  various 
properties  of  Lipschitz  continuity  (in  rj)  of  the  solution  to  this  Poisson  equation. 

Unfortunately,  in  all  these  references,  the  conditions  underlying  the  second  step  of  the  ODE 
method  are  given  in  implicit  form  and  are  often  hard  to  verify  in  specific  situations.  What  seems 
required  is  a  more  operational  convergence  theory  where  conditions  are  given  directly  in  term  of 
the  model  data.  For  instance,  this  was  done  by  the  authors  [16]  in  the  Markovian  situation  when 
the  state  space  S  is  finite ,  in  which  case  (1.2)  reduces  to 

P[Xn+ 1  =  2/|-X'(0),  j?(0),  X(l), . . . ,  X(n),  t?(n)]  =  pxnV(v(n))  n  =  0, 1, . .  .(1.3) 

for  every  y  in  S  for  some  family  {P(r]),  i)  £  G}  of  one-step  transition  probabilities  with  P(r])  = 
(Pxy(v))-  A  comprehensive  convergence  theory  was  developed  under  the  mild  condition  of  Lipschitz 
continuity  for  the  one-step  transition  probabilities  r)  — ►  pXy(v)-  This  was  achieved  by  using  a  variant 
of  the  approach  proposed  by  Metivier  and  Priouret,  and  leads  to  an  a.s.  convergence  result. 

When  the  state  space  S  is  countably  infinite,  the  situation  is  much  more  difficult  and  no  general 
results  seem  available,  which  guarantee  a.s.  convergence  in  terms  of  explicit  conditions  on  the  model 
data.  The  main  technical  difficulty  in  the  approach  of  Metivier  and  Priouret  stems  from  the  fact 
that  several  quantities  of  interest  are  no  longer  bounded  and  that  the  requisite  properties  of  the 
solution  to  the  Poisson  equation  are  now  much  harder  to  obtain.  This  paper  presents  arguments 
for  establishing  these  smoothness  properties  and  the  a.s.  convergence  of  the  algorithm.  Although 
the  discussion  is  carried  out  in  the  context  of  an  adaptive  control  problem  for  a  specific  queueing 
system,  the  approach  is  of  much  wider  applicability  and  should  be  of  use  in  analyzing  a  large  class 
of  Stochastic  Approximations  driven  by  a  Markov  chain  on  a  countable  state  space.  The  approach 
relies  on  the  recurrence  structure  of  the  controlled  system  [18],  and  on  a  probabilistic  interpretation 
of  the  solution  to  the  Poisson  equation  derived  from  it  [25]. 

The  queueing  system  considered  here  is  now  briefly  described;  a  precise  model  formulation  is 
available  in  Section  2:  Consider  a  system  composed  of  K  infinite  capacity  queues  that  compete 
for  the  use  of  a  single  server.  Time  is  slotted  with  the  service  requirement  of  each  customer 
corresponding  exactly  to  one  time  slot.  At  the  beginning  of  each  time  slot,  the  controller  gives 
priority  to  one  of  the  queues  according  to  some  prespecified  dynamic  priority  assignment,  and  the 
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selected  queue  is  given  service  attention  during  that  slot.  However,  due  to  a  variety  of  reasons 
ranging  from  server  failure  to  exogenous  interferences,  with  a  positive  probability,  the  service  fails, 
in  which  case  the  service  of  that  customer  is  rescheduled  at  a  later  time  in  accordance  with  the 
service  allocation  policy.  When  in  a  given  time  slot  the  service  succeeds,  the  customer  is  either 
declared  serviced  and  leaves  the  system  at  the  end  of  the  slot,  or  is  routed  to  one  of  the  other 
queues  with  a  fixed  probability,  depending  on  both  source  and  destination  queues.  In  the  present 
paper,  the  failures  are  assumed  generated  through  independent  Bernoulli  processes,  with  possibly 
class-dependent  parameters,  and  this  independently  of  the  arrival  mechanism.  New  customers  may 
arrive  in  batches,  which  are  modelled  as  an  arbitrary  if-dimensional  renewal  process,  to  capture 
partial  correlations  between  arrivals  from  different  classes  in  a  given  slot. 

This  queueing  system  and  its  variants  consitute  useful  models  for  studying  resource  allocation 
issues  in  several  application  areas,  including  computer  systems  and  data  networks,  and  as  such 
they  have  received  a  great  deal  of  attention  in  recent  years.  Klimov  [12]  studied  a  continuous¬ 
time  version  of  this  system,  and  proved  that  a  strict  priority  policy  minimizes  the  discounted  cost 
associated  with  a  cost-per-slot  linear  in  the  queue  sizes.  Tsoucas  and  Walrand  [27]  considered  an 
adaptive  version  of  Klimov’s  problem  where  the  service  distributions  are  unknown. 

The  case  where  no  routing  is  allowed  has  been  much  studied.  For  such  systems,  Sidi  and 
Segall  [26]  derived  the  joint  equilibrium  distribution  of  the  queue  size  under  a  fixed  priority  scheme. 
Several  authors  [3,4,7,10]  have  shown  that  the  /ic-rule  minimizes  a  variety  of  performance  measures 
associated  with  the  aforementioned  linear  cost  structure.  In  [21],  Nain  and  Ross  considered  the 
situation  where  several  types  of  traffic,  say  voice,  video  and  data,  compete  for  the  use  of  a  single 
synchronous  communication  channel.  They  formulated  this  situation  as  a  system  of  K  discrete- 
time  queues  and  found  the  service  allocation  strategy  minimizing  the  long-run  average  of  a  linear 
expression  in  the  queue  sizes  of  K  —  1  customer  classes,  under  the  constraint  that  the  long-run 
average  queue  size  of  the  remaining  customer  class  does  not  exceed  a  certain  value.  Extending 
some  of  the  optimality  results  from  Baras,  Ma  and  Makowski  [4],  they  showed  that  if  the  constraint 
can  be  met,  then  the  optimal  policy  g  is  a  Markov  stationary  policy  with  the  following  structure: 
There  exist  two  static  work-conserving  service  assignment  policies  (of  which  pc- rules  are  only  one 
description),  say  g  and  g,  and  a  scalar  rf  in  (0, 1).  At  the  beginning  of  each  time  slot,  a  coin  with 
bias  7j*  is  flipped,  and  the  policy  g  implements  channel  rights  according  to  the  outcome  via  g  and  g 
with  probability  77*  and  1  —  77* ,  respectively.  The  bias  77*  is  determined  so  as  to  meet  the  constraint. 
This  result  was  extended  by  Altman  and  Shwartz  to  the  case  where  the  constraint  is  also  given 
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through  a  linear  combination  [1,2]. 

These  results  are  typical  in  that  analysis  often  identifies  a  policy  g  of  interest  which  is  Markov 
stationary.  Unfortunately,  this  policy  may  not  be  readily  implementable  due  either  to  a  lack  of 
knowledge  of  the  actual  values  of  some  parameters  [13]  or  to  computational  difficulties  inherent  to 
its  definition.  The  situation  treated  by  Nain  and  Ross  is  a  good  case  in  point,  for  there  non-trivial 
off-line  computations  are  required  in  order  to  actually  compute  the  value  of  the  bias  77*,  even  if  all 
parameters  are  known. 

This  implementation  issue  provides  the  motivation  for  the  Stochastic  Approximation  studied 
in  this  paper.  In  Section  3,  the  issue  is  discussed  in  the  broader  context  of  “steering  the  cost  to  a 
given  value”,  with  a  view  towards  applications  to  constrained  optimization  [1,2,21].  The  problem 
is  now  one  of  finding  the  bias  77*  needed  in  a  simple  randomization  between  two  policies  g  and 
g  in  order  to  steer  a  long-run  average  cost  to  a  given  value.  The  resulting  randomized  Markov 
stationary  policy  -  denoted  g  herefater  -  can  be  implemented  by  means  of  a  projected  Stochastic 
Approximation.  This  algorithm  computes  on-line  estimates  of  77*  which  are  then  used  in  a  Certainty 
Equivalence  controller  a  based  on  the  special  form  of  g.  Theorems  3.1  and  3.2  contain  the  main 
results  concerning  the  performance  of  this  policy  a ,  namely  that  the  policies  a  and  g  yield  the  same 
value  for  the  long-run  average  cost,  and  that  the  iterates  (77(71),  n  —  0, 1, . . .}  converge  a.s.  under  a 
to  the  bias  value  77*.  This  improves  on  earlier  results  of  the  authors  [23]  for  the  same  algorithm  in 
the  context  of  the  system  with  two  queues  with  no  routing.  There,  only  convergence  in  probability 
was  established,  albeit  under  weaker  conditions  on  moments. 

The  convergence  proof  for  the  Stochastic  Approximation  algorithm  hinges  on  the  availability  of 
bounds  on  moments  of  the  queue  size  process  which  are  uniform  in  the  policy,  and  on  the  smoothness 
properties  of  solutions  to  an  associated  Poisson  equation  [20,  25].  The  bounds  are  obtained  in 
Section  4  by  means  of  renewal  arguments  which  relate  the  queue  size  to  the  recurrence  times  to 
the  empty  state.  In  Section  5,  novel  arguments  are  developed  for  proving  the  Lipschitz  continuity 
for  solutions  to  the  Poisson  equation  and  for  establishing  bounds  on  them.  It  is  appropriate  to 
stress  the  methodological  value  of  both  Sections  4  and  5,  in  that  ideas  therein  are  by  no  means 
restricted  to  the  competing  queue  model  and  can  be  used  mutatis  mutandis  in  many  other  situations. 
However,  the  approach  was  developed  here  for  a  Stochastic  Approximation  algorithm  for  a  specific 
model,  rather  than  for  general  Markov  chains  with  countable  state  spaces,  in  order  to  present  the 
arguments  more  clearly,  unemcumbered  from  technical  details  which  often  accompany  more  formal 
treatments. 
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The  a.s.  convergence  of  the  Stochastic  Approximation  scheme  defining  the  implementation  a 
is  established  in  Section  6,  where  the  various  estimates  of  the  previous  sections  allow  for  a  rather 
simple  proof.  Finally,  the  cost  properties  of  the  policy  a  are  discussed  in  Section  7  by  making 
use  of  the  convergence  of  the  Stochastic  Approximation  and  invoking  the  results  on  the  “Certainty 
Equivalence”  Principle  developed  in  [24];  the  requisite  hypotheses  of  [24]  are  easily  verified  for 
this  system  with  the  help  of  bounds  on  solutions  to  the  Poisson  equation.  The  paper  concludes 
with  an  application  to  the  constrained  optimization  problem  discussed  by  Nain  and  Ross  in  [21]. 
All  necesssary  conditions  are  verified  and  the  policy  a  thus  constitutes  an  implementation  of  the 
Markov  stationary  policy  which  is  constrained  optimal  for  this  problem. 

A  few  words  on  the  notation  and  conventions  used  throughout  the  paper:  The  set  of  all  non¬ 
negative  integers  is  denoted  by  IN,  and  ]R  (resp.  ]R+)  stands  for  the  set  of  all  real  (resp.  positive 
real)  numbers.  Elements  of  are  always  interpreted  as  K  X  1  column  vectors,  and  the  kth 
component  of  any  element  x  of  ]RK  is  denoted  by  a:*,  1  <  k  <  K,  with  a  similar  convention  for 
random  variables  (RVs).  Thus  an  element  x  of  1R*  can  also  be  written  as  (»!,..., xk)'  (with  ' 
denoting  transpose),  and  its  norm  is  given  by  |  x  |:=  |rjt|.  The  elements  e  and  0  of  IR^  are 

defined  as  the  vectors  e  =  (1,  ...,1)'  and  0  =  (0,...,0)'  with  identical  components.  The  standard 
basis  {e1, ...,  eK}  for  1R^  is  denoted  by  Bki  while  Sk  is  the  standard  simplex  defined  by 

K 

Sk  :=  {p  G  1RX  :  =  1  and  0  <  p*  <  1,  1  <  fc  <  K}.  (1.4) 

fc=i 

The  indicator  function  of  a  set  A  is  denoted  by  /[A].  Unless  stated  otherwise,  the  notation  limn 
and  limn  are  understood  with  n,  going  to  infinity. 

2.  MODEL  AND  ASSUMPTIONS 
2.1  The  basic  random  variables 

In  this  paper,  all  probabilistic  elements  are  defined  on  a  single  sample  space  ft  equipped 
with  the  cr-field  of  events  T .  This  sample  space  carries  the  basic  RVs  E,  {U(n),  n  —  0,1,...}, 
{A(n),  n  =  0,1,...},  n  =  0,1,...}  and  (R(n),  n  =  0,1,...}  which  take  values  in  IN  , 

Bki  1N/C,  {0, 1}K  and  {0, 1, . ..K}k,  respectively.  The  information  RVs  {H(n),  n  =  0,1,...}  are 
recursively  defined  by  H(0)  :=  E  and 

:  H(n+  1)  :=  (H(n),U(n),A(n),B(n),R(n));  n  =  0, 1, . .  .(2.1) 
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they  take  values  in  the  information  spaces  {Hn,  n  =  0,1,...}  where  Ho  :=  IN*  and  Hn+1  := 
H  bx5kX  IN*  x{0,  1}*  X  {0, 1, . . .,  K}k  for  all  n  =  0, 1, . . .. 

These  quantities  have  a  ready  interpretation  in  the  context  of  the  queueing  system  described 
in  the  introduction:  The  number  of  customers  initially  in  the  kth  queue  is  set  at  S*  and  for  each 
n  =  0, 1, . . .,  the  state  of  the  system  is  represented  by  a  RV  X(n)  of  integer  components  with  the 
interpretation  that  at  the  beginning  of  the  slot  [n,n  +  1),  Xk(n)  customers  are  present  in  the  kth 
queue,  including  the  one  receiving  service.  The  following  chain  of  events  occurs: 

(i) :  The  control  action  U(n)  is  selected  with-the  convention  that  Uk(n)  =  1  (resp.  Uk(n)  =  0) 

if  the  kth  queue  is  (resp.  is  not)  given  service  attention  during  that  slot.  The  fact  that 
U(n)  takes  values  in  Bk  guarantees  that  exactly  one  queue  is  given  service  attention; 

(ii) :  New  customers  arrive  into  the  system  according  to  the  RV  A(n)  with  Ak(n )  new  customers 

joining  the  kth  queue; 

(iii) :  A  completion  of  service  possibly  occurs  at  the  queue  that  was  given  service  attention  during 

the  slot.  This  is  encoded  in  the  binary  RV  B(n),  where  Bk(n)  =  1  (resp.  Bk(n)  =  0) 
signifies  successful  completion  (resp.  abortion)  of  service  for  the  kth  queue  conditioned  on 
it  being  given  service  attention  and  non-empty;  and 

(iv) :  If  a  service  completion  occurs  at  the  queue  that  was  given  service  attention  during  the 

slot,  then  instantaneously  the  serviced  customer  is  either  transferred  to  another  queue  or 
it  leaves  the  network.  This  routing  decision  is  implemented  through  the  variable  R{n) 
with  the  following  interpretation.  If  the  service  completion  occurred  at  the  kth  queue, 
then  Rk(n)  =  £,  1  <  l  <  K,  means  that  the  serviced  customer  joins  the  Ith  queue  while 
Rk(n)  =  0  expresses  the  fact  that  this  customer  leaves  the  system. 

As  a  result,  the  successive  system  states  or  queue  size  vectors  form  a  sequence  {X(n),  n  = 
0, 1, . . .}  of  IN* -valued  RVs  which  are  generated  componentwise  through  the  recursion 

Xk(n+  1)  =  Xk(n)+Ak(n)  -  I[Xk(n)  ±  0 }Uk(n)Bk(n) 

K 

+  £  J[X*(n)  ^  0 ]Ut(n)Be(n)I[Re(n)  =  k] 
e=i 

1  <k<K,  n  =  0,1,...  (2.2) 

with  X(0)  :=  S. 

At  the  beginning  of  each  time  slot  [n,  n  +  1),  the  channel  controller  has  access  to  the  initial 
queue  sizes  S,  the  past  arrival  pattern  A(i),0  <  i  <  n,  past  decisions  17(0,0  <  i  <  n,  past  service 
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completions  B(i),  0  <  i  <  n  and  past  routing  decisions  R(i),  0  <  i  <  n.  Thus,  the  decision-maker 
has  knowledge  of  the  RV  H(n)  which  is  used  to  generate  the  control  value  U(n)  implemented  in  the 
slot  [n,n  +  1).  The  selection  of  this  control  value  is  done  according  to  a  prespecified  mechanism, 
which  may  be  either  deterministic  or  random. 

2.2  The  probabilistic  structure 

Since  randomized  strategies  are  allowed,  an  admissible  control  policy  n  is  defined  as  any  col¬ 
lection  {7rn,  n  =  0,1,...}  of  mappings  7rn  :  Hn  — ►  Sk,  with  the  interpretation  that  at  times 
n  =  0, 1, . . .,  the  kth  queue  is  given  service  attention  with  probability  nn(k\  hn)  whenever  the  in¬ 
formation  vector  hn  in  Hn  is  available  to  the  system  controller.  Denote  the  collection  of  all  such 
admissible  policies  by  V. 

Let  qs(-)  and  q(-)  be  two  probability  mass  distributions  on  1NK,  and  fix  a  service  rate  vector 
p  in  (0, 1]K.  Moreover,  let  P  denote  a  K  X  K  substochastic  matrix  (pkt,  1  <  k,t  <  K ),  i.e., 

K 

0<pkt<l  and  J2Pkt  ^  1  ’  1  <k,t<K  (2.3) 

e=i 


and  set 


K 

Pko  •=  1  -  ^2 Pkt  >  1  <k<K. 

i=i 


Throughout  the  discussion,  the  non-degeneracy  condition 


(2.4) 


0  <  q( 0)  <  1 


(2.5a) 


and  the  finite  mean  condition 

7;  |a|g(a)  <  oo  (2.56) 

oeNK 


are  enforced,  and  it  is  always  assumed  that  the  matrix  I  —  P  is  invertible,  a  condition  which 
is  equivalent  to  the  system  being  open,  i.e.,  every  customer  eventually  leaves  the  system  with 
probability  one. 

The  model  is  now  completely  specified  by  postulating  the  existence  of  a  family  {P*,  it  G  V} 
of  probability  measures  on  the  cr-field  T  which  satisfies  the  requirements  (R1)-(R3)  below,  i.e.,  for 
every  policy  7r  in  V, 

(Rl):  For  all  x  in  1NK, 


P*[E  =  x]  :=  qz(x)  ; 
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(R2):  For  all  a  in  1NK,  b  in  {0, 1}K  and  r  in  {1,2,. .  .,K}k , 

P*[.A(n)  =  a,  B{n)  =  b,  R(n)  =  r  |  Tn  V  tf{^(n)}] 

:=P*[A(n)  =  a]Pir[P(n)  =  6]P'[P(n)  =  r] 

n  =  0, 1,... 

K  K 

=9(0)  •  n(6*/^ + u  -  ^x1  -  A**))  •  n  ^ 

*=1  *=1 

where  for  each  n  =  0, 1, . . Tn  denotes  the  <r-field  on  the  sample  space  Q.  generated  by 
the  RV  lf(ra);  and 

(R3):  For  all  ek  in  Bki  1  <  k  <  K, 

P*[U(n)  =  efc|Pn]  ;=  ir n(k;Hn).  n  =  0,1,... 

The  existence  of  a  sample  space  (fl,  P)  that  carries  such  a  family  of  probability  measures 
{P*,  7r  6  V}  is  easily  established  via  the  Kolmogorov  Extension  Theorem,  by  taking  ft  to  be  the 
canonical  space  X  ( Bk  X  ]NK  X  {0, 1}*  X  {0, 1, . . .,  K}k)°°  equipped  with  its  natural  <r-field. 
This  modeling  approach  was  adopted  in  [23]  for  a  special  case  of  the  Markov  decision  process  under 
consideration;  the  reader  is  referred  there  for  additional  information. 

The  reader  will  readily  check  that  under  each  probability  measure  P’r,  the  following  properties 
(P1)-(P4)  hold  true,  where 

(PI):  The  IN^-valued  RV  S  and  the  sequences  of  RVs  {.A(n),  n  =  0, 1, . . .},  {J9(n),  n  =  0, 1, . . .} 
and  {jR(n),  n  =  0, 1, . . .}  are  mutually  independent ; 

(P2):  The  sequences  (Pfc(n),  n  =  0, 1,...}  of  {0, l}-valued  RVs  are  mutually  independent  i.i.d. 

Bernoulli  sequences  with  parameters  pk,  1  <  k  <  K\ 

(P3):  The  sequences  {Pfc(n),  n  =  0, 1,...}  of  {0, 1, . . ., RT}-valued  RVs  are  mutually  i.i.d.  se¬ 
quences  with 

P*[flk(n)  =  <l  =  PM,  1  <k,l<K-,  »  =  0,1,...  (2.6) 

(P4):  The  IN  -valued  RVs  {A(n),  n  =  0, 1, . . .}  form  a  sequence  of  i.i.d.  RVs  with  common 
probability  distribution  <?(•). 

For  1  <  k  <  K,  denote  by  A*  the  first  moment  of  the  sequence  {Afc(n),  n  —  0, 1, . . .}  and  set 
Vk  =  Pk~l  •  For  future  use,  define  the  network  traffic  coefficient  p  by 

p  :=  A '(/  -  P)-V  (2.7) 
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where  A  =  (Ai,...,Ak)'  and  v  =  (vi,...,vk)' • 

2.3  Several  families  of  policies 

A  policy  7r  in  V  is  said  to  be  a  Markov  policy  if  there  exists  a  family  {gn,  n  =  0,1,...} 
of  mappings  gn  :  IN*  — ►  Sk  such  that  xn(H(n))  =  gn(X(n))  Pv- a.s.  for  all  n  =  0,1,...,  with 
{X(n),  n  =  0, 1, . . .}  generated  through  the  recursion  (2.2).  When  the  mappings  {<?„,  »  =  0, 1, . . .} 
are  all  identical  to  some  mapping  g  :  IN*  -►  Sk,  the  Markov  policy  x  is  termed  stationary  and  can 
be  identified  with  the  mapping  g  itself,  as  will  be  done  repeatedly  in  the  sequel. 

A  policy  x  in  V  is  said  to  be  non-idling  or  work-conserving  whenever  for  all  1  <  k  <  K,  the 
condition 

[7rn(fc;  H(n ))  >  0,  X(n )  ^  0]  =  [*■„(£;  H(n))  >  0,  Xk(n)  ^  0]  n  =  0, 1, . .  .(2.8) 

holds  true  P^-a.s. 

3.  PROBLEM  FORMULATION 

Let  c  denote  a  mapping  IN*  -*■  1R.  For  any  admissible  policy  x  in  V,  set 

J(«)  :=  E  '(* 1 (0)]  (3-1) 

(whenever  meaningful)  with  the  usual  interpretation  that  J{x)  is  a  measure  of  system  performance 
when  using  the  policy  x. 

3.1  Steering  the  cost 

Consider  the  problem  of  steering  the  cost  (3.1)  to  a  given  value,  i.e.,  finding  a  Markov  stationary 
policy  g  such  that  J{g)  -  V  for  some  constant  V  determined  possibly  through  design  considerations. 
As  demonstrated  by  Ross  [22],  versions  of  this  problem  naturally  arise  in  solving  constrained  MDPs 
by  Lagrangian  arguments.  Here  the  discussion  is  given  under  the  assumption  that  there  exist  two 
Markov  (possibly  randomized)  stationary  policies  £  and  g  such  that 

J(£)  <  F  <  J(g).  (3-2) 

For  every  r)  in  the  unit  interval  [0,1]  the  policy  fv,  obtained  by  simply  randomizing  between  the 
two  policies  g  and  g  with  bias  77,  is  the  Markov  stationary  policy  determined  through  the  mapping 

Z'1  :  IN*  — ►  Sk  where 

:=  t]g(k;x)  +  (1  -  rf)g(k;x),  1  <  k  <  K  (3.3) 
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for  all  a;  in  1NK.  Note  that  for  7/  —  1  (resp.  77  =  0)  the  randomized  policy  fv  coincides  with  g  (resp. 
g).  Owing  to  (3.2),  if  the  mapping  7?  — ►  J(P)  is  continuous  on  the  interval  [0,1],  then  at  least  one 
randomized  strategy  fn’  meets  the  value  V  and  its  corresponding  bias  value  7/*  is  a  solution  of  the 
equation 

J(fv)  =  V ,  V  in  [0,1],  (3.4) 

so  that  the  identification  g  =  p“  may  take  place. 

3.2  Implementation  issues 

Solving  the  (highly)  nonlinear  equation  (3.3)  for  the  bias  value  77*  is  usually  a  non-trivial 
task,  even  in  the  simplest  of  situations  [17,  21].  This  difficulty  may  be  circumvented  by  proposing 
alternatives  to  the  policy  g  which  bypass  a  direct  solution  of  the  equation  (3.3).  One  possible 
approach  is  to  design  (simple  recursive)  schemes  for  estimating  the  value  7 f  which  solves  (3.4) 
and  then  to  define  a  so-called  “naive  feedback”  policy  a  =  (ctn,  n  =  0,1,...}  via  the  Certainty 
Equivalence  Principle.  Such  a  policy  a  can  be  written  in  the  form 

On  :=  v(n)g  +  (1  -  T](n))g  n  =  0, 1, . .  .(3.5) 


for  some  sequence  of  [0,l]-valued  RVs  (7/(n),  n  —  0, 1, . . .}  which  act  as  “estimates”  for  the  bias  value 
rf.  It  is  hoped  that  the  effects  of  controlling  and  learning  about  the  system  will  combine  to  produce 
a  consistent  estimation  scheme.  In  such  a  case,  the  sequence  of  estimates  (7/(71),  n  =  0, 1, . . .} 
converges  to  the  value  rf  in  some  sense,  thus  providing  increasingly  better  approximations  to  the 
appropriate  bias  value.  This  policy  a  will  constitute  an  acceptable  implementation  of  g  provided 
J(a)  =  J(g). 

At  this  point,  the  reader  may  wonder  as  to  how  such  an  estimation  scheme  can  be  selected. 
If  the  function  rj  — ►  J(P)  were  continuous  and  strictly  monotone,  say  increasing  for  sake  of  defi¬ 
niteness,  then  the  search  for  7/*  could  be  interpreted  as  finding  the  zero  of  the  continuous,  strictly 
monotone  function  7/  — ►  J(P)  -  V,  and  this  brings  to  mind  ideas  from  the  theory  of  Stochastic 
Approximations  [14].  Here,  the  Robbins-Monro  version  of  these  algorithms  suggests  that  a  sequence 
of  bias  values  (77(77),  n  =  0, 1, . . .}  be  generated  through  the  recursion 


7/(n+  1)  = 


7/(n)  +  an+x(F  -  c(X{n  +  1)) 


n  =  0, 1, . .  .(3.6) 


with  7/(0)  given  in  [0,1].  In  (3.6)  the  notation  [a;]o  =  0  V  (x  A  1)  is  used  for  every  x  in  R,  and  the 
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sequence  of  step  sizes  {an,  n  =  1,2,...}  satisfies  the  conditions 

OO  00 

0  <  on  i  0,  ^  a„  =  oo  and  ^  an2  <  oo. 

n=l  n=l 


(3.7) 


If  the  mapping  77  — ►  J(fn)  were  monotone  decreasing,  then  the  Stochastic  Approximation 
algorithm  (3.6)  is  modified  by  replacing  V  -  c(X(n  +  1))  with  c(X(n  +  1))  -  V. 

3.3  The  results 

This  paper  is  devoted  to  analyzing  the  behavior  of  the  system  under  the  adaptive  policy  a 
defined  through  (3.5)  and  (3.6),  and  the  main  results  in  this  direction  are  now  described.  To  do 
this  requires  the  additional  assumptions  (R4)-(R6)  on  the  data  of  the  problem,  where 
(R4):  There  exists  some  integer  7  >  1  such  that  for  every  policy  7r  in  V,  the  moment  conditions 

>(m=  E 

igN* 


and 

£'[  MMIl  =  E  MV*)  <  00 

<*€N* 


hold  true; 

(R5):  There  exist  an  integer  6  >  0  and  a  constant  L  >  0  in  1R  such  that 


K*)l  <  1(1+  I  x  I*)  =:  c(|x|) 


n  =  0, 1,... 


for  all  x  in  IN^;  and 

(R6):  The  policies  ~g  and  <j_  are  non-idling  Markov  stationary  policies  such  that  (3.2)  holds. 

The  results  concerning  the  policy  a  are  now  summarized. 

Theorem  3.1.  Assume  (R1)-(R6)  to  hold  with  p  <  1  and  let  the  integer  exponent  7  in  (R5)  satisfy 
the  condition 

26  +  3  <  7.  (3.8) 

If  the  mapping  rj  —>■  J(fv)  is  strictly  monotone,  then 

limn  77(71)  =  77*  Pa  —  a.s.  (3.9) 
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Under  these  conditions,  the  system  also  satisfies  a  Certainty  Equivalence  Principle  [24].  A 
formal  version  of  this  property  takes  the  following  form. 

Theorem  3.2.  Assume  (R1)-(R6)  to  hold  with  p  <  1  and  let  the  integer  exponent  7  in  (R5)  satisfy 
the  condition 

max{3, 1  +  6(1  +  e)}  <  7  (3.10) 

for  some  e  >  0.  7/limn  77(71)  =  77*  in  probability  under  Pa ,  then  the  convergence 

J(a)  =  Hmt  —  £  c(X(s))  =  J(g)  (3.11) 

'  1  3=0 

takes  place  in  L1  (ft,  T,Pa),  so  that 

J(a)  =  limt  E°  [jj-j  £  c(X(a))]  ?  J(g).  (3.12) 

3=0 

Moreover,  for  any  other  mapping  d  :  JtiK  —*  JR,,  if  there  exist  an  integer  S'  >  0  and  a  constant 
L'  >  0  such  that 

K*)|  <  L\  1+  |  x  |5')  (3.13) 

for  all  x  in  IN^,  then  both  (3.11)  and  (3.12)  hold  for  the  long-run  average  cost  (3.1)  associated 
with  d  provided  the  condition 

max{3, 1  +  tf'(l  +  e')}  <  7  (3.14) 

holds  for  some  e'  >  0. 

The  restriction  that  S  and  S'  be  integers  is  not  essential  but  results  in  some  simplifications  in  the 
notation.  An  example  where  the  hypotheses  of  Theorems  3. 1-3.2  hold  is  given  in  Section  7. 

This  section  closes  with  a  few  facts  which  are  easily  derived  from  the  enforced  assumptions: 
Under  (R6),  the  policies  fv,  0  <  77  <  1,  and  a  are  all  non-idling  since  g  and  g  are  non-idling. 
Moreover,  note  from  (2.2)  that 

Xk(n  +  l)<  Afc(n) -Mfc(7i)  +  1,  1  <  k<K.  n  =  0, 1, . .  .(3.15) 

It  then  follows  from  (R4)  that  £”r[|X(7i)|7]  <  00  for  all  n  =  0, 1, . . .  under  any  policy  7r  in  V.  Since 
6  <  7  under  either  (3.8)  or  (3.10),  it  is  then  immediate  from  (R5)  that 

Ev  [|c(X(n))|]  <  L(l  +  E*  [|X(ti)|5])  <  00  (3.16) 
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and  therefore  J(i r)  is  always  well  defined  and  finite.  A  similar  argument  shows  that  under  the 
conditions  (3.13)  and  (3.14),  the  long-run  average  cost  associated  with  d  is  also  well  defined  under 
any  policy  7r  in  V. 

4.  MOMENT  ESTIMATES 
4.1.  The  bounds 

The  proofs  of  Theorems  3.1  and  3.2  require  that  bounds  on  moments  of  the  RVs  {|  X(n)  | ,  n  = 
0, 1, . . .}  be  available  which  are  uniform  over  the  class  of  all  non-idling  policies  7r  in  V.  The 
derivation  of  such  bounds  is  given  below,  and  is  based  on  the  key  observation  that  the  total  number 
of  customers  in  the  system  at  any  given  time  t  decreases  by  at  most  one  unit  in  the  next  time  slot 
[t,  t  +  1),  and  is  therefore  bounded  above  by  the  number  of  slots  it  takes  for  the  queue  sizes- to 
empty  for  the  first  time  after  t.  This  simple  fact  can  be  used  to  advantage  when  combined  to  the 
detailed  statistical  information  obtained  by  the  authors  on  the  time  until  the  system  empties  [18], 
and  leads  to  the  following  strong  estimates. 

Theorem  4.1.  Assume  (R1)-(R5)  with  p  <  1.  There  exists  a  single  positive  constant  such 
that  for  every  non-idling  policy  n  in  V,  the  moment  estimate 

supn  E*[\  X(n)  |7_1]  <^<00  (4.1) 


holds  true. 

Theorem  4.1,  the  proof  of  which  is  presented  below,  turns  out  to  be  a  special  case  of  an 
intermediate  result  of  independent  interest  given  in  Theorem  4.4.  Before  discussing  this  more 
general  result,  it  is  convenient  to  notice  the  following  simple  and  useful  consequence  of  (4.1). 

Corollary  4.2.  Assume  (R1)-(R5)  with  p  <  1.  Whenever  7  >  2,  the  RVs  {|A(n)|,  n  =  0, 1, ...} 
are  uniformly  integrable  under  the  probability  measure  P*  associated  with  any  non-idling  policy  ir 
in  V. 

4.2.  Recurrence  properties 

To  formalize  the  argument  outlined  earlier,  it  is  necessary  to  study  the  recurrence  structure 
of  the  process  {X(n),  n  =  0, 1, . . .}  under  any  non-idling  policy  7r  in  V.  To  that  end  consider  the 
RVs  {t*,  k  =  0, 1, 2, . . .}  and  {<r k,  k  —  1,2,...}  defined  recursively  by  r<j  =  <ti  :=  0,  and 

rfc+i  :=  inf{n  >  (Tfc+1  :  X(n)  =  0}  fc  =  0,1,.  ..(4.2a) 
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and 

<r*;+ 1  :=  inf{n  >  Tfc  :  X(n)  ^  0}  k  =  1,2, . .  .(4.26) 

with  the  convention  that  r*+ 1  =  oo  (resp.  tTfc+i  =  oo)  whenever  the  defining  set  is  empty  or  when 
<Tfc  =  oo  (resp.  Tfc+i  =  oo).  Note  that  these  definitions  are  different  from  those  given  in  [18],  for 
there  z/(0)  denotes  the  present  RV  rj.  For  k  =  2,  3, .. .  the  RV  t*  is  the  time  epoch  at  which  the 
system  empties  itself  for  the  ( k  -  l)r4<  time  after  r\,  so  that  &k+i  is  the  time  epoch  when  the  system 
becomes  again  non  empty  for  the  first  time  after  Tfc.  Moreover,  define  the  RVs  {0k,  k  =  1,2, ...} 

by 

0fc+i  =  rk+ 1  -  n  k  =  0,1,...  (4.3) 

so  that  0i  =  Ti.  The  following  results  were  obtained  in  Sections  4-5  of  [18]. 

Proposition  4.3.  Assume  (Rl)-(R5)  with  p  <  1.  Under  any  non-idling  policy  n  in  V,  the  RVs 
{0k,  k  =  1,2,...}  form  a  delayed  renewal  process,  the  statistics  of  which  are  independent  of  the 
policy  7r,  with  finite  means  given  by 


E*[0k\X(O)  =  x]=: 


l  l 

.  1—9(0)  i-p 


if  fc  =  1 

if  k  =  2, 3, . . . 


(4.4) 


Moreover,  the  RV  0<i  possesses  finite  moments  of  order  7,  and  for  every  integer  i,  1  <  t  <  7,  there 
exists  a  positive  constant  Cf  (independent  of  the  policy  w )  such  that 

iHri'|X(0)  =  *]  <  Ct(l  +  \x\e)  (4.5) 

for  all  x  in  IN^. 

In  view  of  this  result,  it  is  natural  to  introduce  Sx  as  the  expectation  operator  with  respect 
to  the  distribution  of  T\  given  that  X(0)  =  x  and  that  any  non-idling  policy  is  used.  Finally,  for 
reference,  denote  by  G(-)  the  distribution  of  the  RV  0i(=  ri)  and  by  F(-)  the  common  distribution 
of  the  i.i.d.  RVs  {0*,  k  =  2, 3, . . .}.  By  definition,  the  distributions  G(-)  and  F(-)  do  not  coincide. 

4.3.  A  renewal  estimate 

The  (continuous-time)  counting  process  {N(t),  t  >  0}  naturally  associated  with  the  sequence 
{rn,  n  =  0, 1, . . .}  is  defined  by 

N(t)  :=  max{/;  >  0  :  <  t},  t  >  0  (4.6) 
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with  the  ready  interpretation  that  N(t)  represents  the  number  of  times  the  queue  has  returned  to 
the  empty  state  by  time  t.  With  this  notation,  the  observation  made  earlier  translates  into 

\X(n)\<TN(n)+i-n.  n  =  0,1,...  (4.7) 

Now,  for  any  monotone  non-decreasing  mapping  r  :  IR4.  — >  IR+,  set 

Ra(t)  :=  f£s[r(rN(1)+1  -  t)],  t  >  0.  (4.8) 

The  subscripts  G  and  E  in  (4.8)  emphasize  the  fact  that  the  system  is  started  with  an  initial  queue 
size  E  distributed  according  to  the  distribution  gs(')*  Since  the  sequence  {Ok,  k  =  2,3, . . .}  is  a 
non-delayed  renewal  sequence,  it  is  appropriate  to  define 

RF{t)  :=  -E^r(Tjv(t+Tl)+i  -  (t  +  n))],  t  >  0  (4.9) 

as  this  corresponds  to  a  non-delayed  renewal  process  with  G  =  F. 

The  first  part  of  this  section  is  devoted  to  the  derivation  of  a  bound  on  the  expected  values 
{Ra{t),  t  >  0}  for  any  non-idling  policy  ir,  with  a  view  towards  generating  (via  (4.7))  a  bound  for 
the  sequence  of  expected  values  {E^rd  X(n)  |)],  n  =  0, 1, . . .}. 

Theorem  4.4.  Assume  (R1)-(R5)  voith  p  <  1  and  let  it  be  an  arbitrary  non-idling  policy  in  V. 
Under  the  finite  moment  assumptions 

r OO  rOO 

ma{r)  :=  /  r{0)dG{0)  <  00  and  mF{r)  :=  /  r(0)dF(6)  <  00,  (4.10) 

Jo  Jo 


the  condition 


t  OO  [6  r  OO  r  OO 

EF(r):=  /  /  r(0  —  t)dtdF(O)  =  /  /  r(0  -  t)dF(0)dt 

Jo  Jo  Jo  Jt 


<  OO 


implies 


sup  RG(t)  =  sup  E^[r(rN(t)+1  -  t )]  <  00. 


00 


t>o 


Proof:  Let  ra  and  rF  be  the  mappings  H+  — ►  1R+  defined  by 


/OO  poo 

r(0  —  t)dG(0)  and  rF(t)  :=  j  r{0  —  t)dF(0),  t  >  0. 


(4.11) 

(4.12) 


(4.13) 
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The  finiteness  conditions  (4.10)  translate  into  ro(0)  =  ma(r)  <  oo  and  rF(0)  =  rnp(r)  <  oo. 
Since  r  takes  positive  values  and  is  monotone  non- decreasing,  the  indefinite  integrals  entering  the 
definition  (4.13)  are  well  defined,  and  satisfy  the  inequalities 


/OO  f  oo  t  oo 

r(Q  —  t)dG{9)  <  J  r{6  —  s)dG{0)  <  J  r(9  -  s)dG{9)  (4.14) 


whenever  0  <  s  <  t.  As  a  result,  the  mapping  tq  is  well  defined  and  monotone  non-increasing. 
Similar  comments  hold  for  rp. 

A  standard  renewal  argument  [11,  pp.  183f  applied  to  the  process  {r(rN^+i  —  t),  t  >  0} 
shows  that 


whence 


Ra(t)  =  J*  RF(t  ~  O)dG(0)  +  J™  r(6  -  t)dG{0),  t  >  0 


(4.15) 


Ra(t)  <  f  RF(t  -  0)dG(0)  +  r  r(6)dG(6) 
Jo  Jo 

<  sup  Rf(s)  +  ma(r),  t  >  0 

0<a<< 


(4.16a) 

(4.166) 


by  the  remarks  made  earlier.  This  clearly  shows  that  under  (4.10),  the  result  (4.12)  will  hold  if  the 
bound 

sup  Rp(t)  <  oo  (4.17) 

t>o 


can  be  established. 

When  G  =  F,  the  renewal  equation  (4.15)  specializes  to 


RF(t)  =  rF(«)  +  f  RF(t  -  6)dF(0),  t  >  0.  (4.18) 

Jo 

Since  the  mapping  rp  is  monotone  non-increasing  and  takes  non-negative  values,  it  is  therefore 
integrable  as  a  result  of  (4.11),  whence  directly  Riemann  integrable  [11,  pp.  190-191].  The  fact  that 
0  <  rF(t)  <  mF{r)  for  all  t  >  0  implies  that  Rf  is  bounded  on  finite  intervals  [11,  Thm.  4.2,  p. 
184].  Finally,  note  that  the  distribution  F(-)  has  support  on  IN  and  is  therefore  arithmetic,  say 
with  span  d.  All  requisite  conditions  are  now  in  place  to  apply  the  Basic  Renewal  Theorem  [11, 
Thm.  5.5.1,  p.  191]  to  the  renewal  equation  (4.18)  to  obtain 


lim„  Rf(c  +  nd )  = 


d 

mp 


OO 

rp(c  +  nd) 

n= 0 


(4.19) 
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for  all  c  >  0,  where  mF  is  the  first  moment  of  F  (which  is  finite  by  Proposition  4.3).  Since  the 
mapping  rF  is  non-increasing,  it  readily  follows  from  (4.19)  that  for  all  c  >  0, 


limn  RF(c  +  nd)  <  |drF(0)  +  ^  drF(«d)  j 

<  ^  |*f(0)  +  jf  rF(t)dt | 


(4.20) 


mF 


{dmF(r)  +  KF(r)}  <  oo 


where  the  finiteness  of  the  last  bound  results  from  (4.10)  and  (4.11).  In  particular, 


]lmn  Rp(nd  +  {dmp(r)  +  Kpir)}  <  oo  ,  l— l,2,...,d  (4-21) 

171 J? 

and  therefore 

supn  Rf{u)  <  oo.  (4.22) 

Since  N(t)  in  constant  on  [n,  n  +  1),  direct  inspection  of  (4.9)  shows  that  RF(t)  <  RF(n)  whenever 
n  <t  <  n  +  1  owing  to  the  monotonicity  of  r,  and  (4.17)  is  now  immediate  since  sup<>0  RF(t)  = 
sup„  #f(™)-  ■ 

Proof  of  Theorem  4.1:  Start  with  the  mapping  r  given  by  r(x)  =  x'Y_1  for  all  x  >  0,  and  observe 
that 

/•OO  r6  1  /•OO 

KF{r)  =  /  /  r{6  -  t)dtdF(9 )  =  -  /  07dF(0).  (4.23) 

Jo  Jo  7  Jo 

Under  (R4),  Proposition  4.3  and  (4.23)  imply  the  conditions  (4.10)  and  (4.11),  and  a  straightforward 
application  of  Theorem  4.4  yields  (4.1).  ■ 

5.  ON  THE  POISSON  EQUATION 
5.1.  The  Poisson  equation 

Fix  T)  in  the  unit  interval  [0, 1]  and  denote  by  Pv  (resp.  Ev)  the  probability  measure  (resp.  ex¬ 
pectation  operator)  induced  by  the  policy  fv.  Moreover,  let  P£  (resp.  E%)  denote  the  (conditional) 
probability  measure  (resp.  expectation  operator)  induced  by  the  policy  fn  given  that  X(0)  =  x, 
with  x  ranging  in  IN^. 
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Recall  that  under  Pn,  the  RVs  {X(n),  n  =  0, 1, . . .}  form  a  time-homogeneous  Markov  chain 
over  IN*,  and  let  (Pv(x,y))  denote  the  corresponding  one-step  transition  probabilities.  It  is  plain 
from  (3.3)  that 

Pv(x,  y )  =  r)Px{x,  y)  +  (1  -  r})P°(x,  y)  (5.1) 

for  all  x  and  y  in  IN*  where  (P1(x,y))  (resp.  (P°(x,j/)))  are  the  one-step  transition  probabilities 
under  g  (resp.  g). 

The  mapping  h  :  IN*  — ►  1R  and  the  scalar  J  solve  the  Poisson  equation  (associated  with  the 
policy  fn)  with  forcing  function  c  :  IN*  — >•  1R  if 

h(x)  +  J  =  c(x)  +  ^2yPv(x,  y)h(y),  x  in  IN*  .  (5.2) 

Clearly  the  solution  pair  ( h ,  J)  to  (5.2)  depends  on  t),  and  it  is  the  purpose  of  this  section  to  establish 
its  regularity  properties  with  respect  to  r).  This  information  is  essential  both  for  establishing  the 
validity  of  the:  Certainty  Equivalence  Principle  [19,  24]  and  for  studying  the  convergence  of  the 
Stochastic  Approximation  algorithm  (3.6)  by  the  method  of  Metivier  and  Priouret  [20].  ^From  now 
on,  this  dependence  of  h(x)  and  J  on  the  bias  tj  is  denoted  simply  by  h(rj,  x)  and  J(rj)  for  all  x  in 
IN*. 

Define  the  first  return  time  to  state  x  —  0  as  the  ^-stopping  time  T  given  by 

T  :=  inf{n  >  0  :  X(n)  =  0}  (5.3) 

so  that  T  =  T\  in  the  notation  of  Section  4.  For  each  l  —  1, . .  .,7,  set 

Tt{x)  :=  Sx[Tl)  =  El\T%  x  in  IN*  (5.4) 

where  the  notation  that  follows  Proposition  4.3  has  been  used.  For  easy  reference  recall  the  estimate 
(4.5),  valid  under  (R5)-(R6),  i.e.,  for  each  l—  1,...,7,  there  exists  a  positive  constant  Ct  so  that 

Tt{x)  <  C*(l+  |  x  |*),  x  in  IN*  .  (5.5) 

As  pointed  out  already  in  Section  4,  during  each  slot,  at  most  one  customer  may  leave  the 
system,  so  that  for  each  t  =  0,1,...,  |A(f)|  is  necessarily  no  larger  than  the  forward  recurrence 
time  (expressed  in  slots)  to  the  empty  state,  and  in  particular  |X(0)|  <  T.  Since  the  mapping 
x  — >■  c(|x|)  defined  in  (R5)  is  a  non-decreasing  function  of  |x|,  it  is  plain  from  (5.5)  that  whenever 
6  +  1  <  7,  the  bounds 


K*(0)l]  <  ^[Eg(l*(<)l)|  ^  Z*[T~<T)\ 

*-t=o 


<  00 


(5.6) 
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hold,  and  the  definition 


rT-l 

C(rj,x):=E2  £>(*(0)  ,  *  in  IN* 

•i= 0 


(5.7) 


is  thus  well  posed.  An  explicit  expression  for  a  solution  to  the  Poisson  equation  is  available  and  is 
now  given  [8,  25]. 

Theorem  5.1.  Assume  conditions  (R1)-(R6)  to  hold  with  p  <  1  and  6  +  1  <  7.  A  solution  pair 
J(i]))  to  the  Poisson  equation  (5.2)  with  h(rj,  0)  =  0  is  given  by 


(5.8  a) 


for  all  x  in  ]NK,  and  the  equalities 


J(n  =  limn  E" 


— i—V 'c(X(t)) 


=  J(rj) 


(5.86) 


hold  true. 

5.2.  Lipschitz  continuity 

The  representation  (5.8)  will  be  put  to  use  in  studying  the  regularity  of  the  solution  pair  to 
the  Poisson  equation  (5.2).  To  simplify  the  presentation  of  the  main  result  of  this  section,  set 

K(x)  :=  £x  [T2c(T)]  ,  x  in  IN*  .  (5.9) 


Theorem  5.2.:  Assume  (Rl)-(R6)  with  p  <  1  and  S  -f  2  <  7.  For  all  x  in  1NK,  K{x)  <  00  and 
the  function  rj  — »  C(rj,  x)  is  Lipschitz  continuous  on  [0, 1]  with  Lipschitz  constant  4 K(x),  i.e., 

|  CM  -  C(v',x)  |<  4K(x)  \rj-V'\  (5.10) 

for  all  rj  and  rj'  in  [0, 1]. 

Proof:  Fix  x  in  1NK  throughout  the  discussion.  That  K(x)  and  tx  [Tc(T)\  are  both  finite  is  plain 
from  (5.5)  under  the  assumption  6  +  2  <  7.  Below  the  result  (5.10)  is  established  for  c  non-negative 
in  the  form 

|  C(r),  x)  —  C(r)',  x)  |<  2K(x)  \v  —  rj'  \  (5.11) 

for  all  ij  and  tj'  in  [0, 1],  so  that  the  result  for  a  general  c  is  now  immediate.  Therefore,  it  suffices  to 
assume  c  to  be  non-negative  in  the  remainder  of  this  proof.  The  arguments  proceed  in  three  steps. 
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Step  1:  Notice  that  for  every  IN  ^-valued  sequence  (x(t),  i  =  0, 1, . . .}  with  x(0)  =  x,  the  relations 

m-l 

P£[X(i)  =  x(i),  1  <  i  <  m]  =  JJ  PT)(x(i),x(i  +  1)),  m  =  1,2, .. .(5.12) 

«=o 

hold  as  a  result  of  the  Markov  property  of  the  chain  (X(n),  n  =  0, 1, . . .}  under  Pn.  The  product 
form  of  (5.12)  and  the  linear  structure  of  (5.1)  now  imply  that  for  each  m  =  1,2, . . .,  the  mapping 
V  ~ jR?[AT(i)  =  x(i),  1  <  i  <  m]  is  a  polynomial  of  degree  m  in  j?  over  [0, 1]  and  has  derivatives  of 
all  orders. 

Set  A  =  [X(i)  =  x(i ),  1  <  i  <  m]  in  (5.12)  and  observe  that 

,  j  m-l 

Sf'SM-Si  II^WO. *(•+!» 

(5.13) 

m— 1  m— 1 : 

=  ^(pi(x(().i(i+ i»-p»(*(o,*(<+i))j  n  p'wo. *(••+!))• 

t=0  i=0,i^t 

This  suggests  defining  for  every  t  =  0,1,...,  the  policy  0<  (resp.  It)  as  the  Markov  policy  that 
operates  according  to  f°  (resp.  f1)  at  time  t,  and  according  to  f*1  otherwise.  With  this  notation, 
(5.13)  now  takes  the  form 

,  m— 1 

i-PUm  =  x(i).  1  <  i  <  m]  =  £  P]'(A]  -  i*-  [A].  (5.14) 

aTl  t= 0 


The  definition  of  the  policies  0t  and  If  implies  that  P**  [A]  =  P£*  [A]  whenever  m  <t,  so  that  (5.14) 
can  also  be  rewritten  as 


^P*W)  =  x(0, 1  <  i  <  m]  =  Pi*  [A]  -  Pi'  [A],  m  <  n. 


t= o 


(5.15) 


Step  2:  To  proceed,  define 


Cm(r),  x )  :=  El 


TAm-l  r  m  k— 1 

l[T<m]  £  c(Jf(«))  *i$El|T  =  t|£<(I(t)) 

L  t= o  Lfc=i  t=o 


(5.16) 


for  all  m  =  1,2,.. ..  The  definition  of  T  implies  that 


[T  =  k]  =  [X(t)  £  0,0  <  t  <  k,X(k)  =  0], 


k  =  1,2,... (5.17) 
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so  that 


fc-i 


Cm(v,  *)  =  E  E  pS[X(i)  =  *(*),  1  <  *  <  k]  E  c(*(«))  (5.18) 

k=l  (x(l) 

»•••!  x(k))ez„ 


t= 0 


where  the  second  sum  is  taken  over  the  set  Zk  given  by 


Zk  :=  {(*(1),  x(2), . . . ,  x(k))  G  (IN*)*  :  x(i)  £  0, 1  <  i  <  k  and  x(k)  =  0}. 


k  =  1,2,  ...(5.19) 

.4 

By  arguments  made  earlier,  it  is  plain  that  on  the  event  [T  =  k],  the  bounds  |X(t)|  <  k,  0  <  t  <  k, 
must  necessarily  hold,  and  therefore  (5.18)  reduces  to 

m  k—1 

Cm(rj,x)  =  J2  E  pS[X(i)  =  x{t),l<i<k]j2c(x(t))  (5.20) 

fe=l  (x(l),...,x(fc))6Z'  t=o 

where  the  finite  set  Z'k  is  given  by 

Z'k  :=  {(x(l),x(2), .  ..,x(k))  6  Zk  :  |x(i)|  <  fc,  1  <  *  <  k).  k=  1,2, .. .(5.21) 

Hence,  in  view  of  remarks  made  earlier  in  the  proof,  the  mapping  rj  — ►  Cm(rj,  x)  is  a  polynomial  of 
degree  m  in  7/  since  it  is  the  sum  of  a  finite  number  of  polynomial  functions,  each  one  of  degree  no 
greater  than  m. 

Since  Cm( rj,  x)  is  a  polynomial  in  rj  for  each  m  =  1,2,...,  the  derivative  Cm(rj,  x )  exists  in  the 
interval  [0, 1].  To  compute  it,  differentiate  (5.20)  and  use  (5.14)-(5.15)  to  conclude  that 


m— 1 


Cm(v,x)=  E#‘ 
<=o 


"T  Am— 1 

E  IF  <  m]c(X(s)) 

.  *=o 


TAm-1 

E  1  [T<m)c(X(s)) 
.  8  =  0 


(5.22) 


The  very  same  argument  that  lead  from  (5.14)  to  (5.15)  now  implies  that  whenever  0  <  k  <  t,  the 
relation 


E\* 


k-l 


i[r=CcW 


8  =  0 


=  El* 


k-l 


i[r- =*]£<*«) 


8=0 


(5.23) 


holds.  Therefore  (5.22)  can  be  rewritten  (in  the  manner  of  (5.16))  as 


m~l 

Cm(V,*)= 


t= 0 


"TAm-1 

E  l[t<T<m]c(X(s)) 

.  s=0 


’TAm-1 

E  1  [t<T  <  m]c(X(s)) 

.  a=0 


.  (5.24) 
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On  the  other  hand, 

m-l  fT  Am— 1  1  m-1  TAm- 1 

E si-  e  n*<T< m]£(j:(»))  < s  *2'  i[< < n  E  1ir £ mWr) 

f=0  3=0  J  t=0  3=0 

m— 1 

<  J]  €m  [1[<  <T<  m\Tc(T )] 

t=o 

<  €x  [T2c(T)]  (5.25) 

by  elementary  calculations.  A  similar  bound  holds  for  the  terms  corresponding  to  the  policies  0<  in 
(5.24).  It  then  follows  from  (5.24)  and  (5.25)  that  the  derivative  Cm(r),x )  of  Cm(r],x )  is  bounded 
on  [0, 1]  by  2 K(x),  and  this  uniformly  in  m,  i.e., 

Cm(ih  *)|  <  2 K(x)  m  —  0, 1, . .  .(5.26) 

for  all  77  in  [0,  i] 

Step  3:  The  easy  estimates 

T—l 

0  <  C(1 7,  x)  -  Cm(rj,  x)  =  El  [l[T  >  m]  £  c(X(t))]  <  El  [l(T  >  m)Tc(T )] 

t=o 

m  =  0,1,...  (5.27) 

imply  via  the  Monotone  Convergence  Theorem  that  limm  Cm(r),x)  =  C(rj,x)  uniformly  in  77  since 
€x[Tc(T)\  <  00.  Consequently,  with  0  <  77  <  77'  <  1, 

1^7(77,  x)  -  0(77',  x)|  =  lirnm |Cm(T7,  x)  -  Cm( v',  *)l 

I  p'  . 

=  limm  /  Cm(y,x)dy  (5.28) 

j  Jr) 

<2K(x)\V-r]'\ 

upon  making  use  of  (5.26),  and  this  establishes  (5.11).  ■ 

Note  that  the  estimate  (5.27)  shows  that  C(r 7,  x)  is  continuous  under  the  weaker  condition  £+1  <  7. 

5.3.  Corollaries 

Theorem  5.2  has  several  useful  consequences  which  are  now  given  in  the  next  few  corollaries. 
The  first  such  corollary  is  obtained  by  combining  Theorems  5.1  and  5.2  in  a  straightforward  manner; 
details  are  left  to  the  interested  reader. 
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Corollary  5.3.  Under  the  hypotheses  of  Theorem  5.2,  the  functions  rj  — ►  J(r))  and  rj  —*■  h(rj,x), 
with  x  ranging  in  1NX,  are  Lipschitz  continuous  on  [0, 1],  i.e.,  for  all  r)  and  rf  in  [0, 1], 


(5-29) 

and 

x )  -  h(r s)|  <  4 Kh(x)  •  | rj  -  rj'\  (5.30) 

with 

Kh(x)  :=  K(x)  +  |||  ■  T,(x),  x  in  N* .  (5.31) 

The  behavior  of  the  Lipschitz  constants  K(x)  and  Kh(x),  and  of  the  solution  h(p,x)  for  -|x| 
large  is  needed  in  some  of  the  arguments  given  in  Section  6.  The  estimates  on  the  Lipschitz 
constants  are  given  first. 

Corollary  5.4.  Assume  (R1)-(R6)  with  p  <  1  and  6  +  2  <  7.  There  exist  a  positive  constant  C 
such  that 

| JST(*)|  <  C  (1  +  |*|{+2)  (5.32a) 

and 

\Kh(x)\  <  C  (1  +  |*|5+2)  (5.325) 

for  all  x  in  IN/<: . 

Proof:  Note  from  (R6)  and  (5.9)  that 

K(x)<  LSx[T2(1  +  T6)\ 

<  2 L€x  [T5+2j  <  2LC5+2  (1  +  |r|5+2)  (5.33) 

with  the  last  inequality  following  from  (5.5),  so  that  (5.32a)  holds  wherever  C  >  2LCs+2-  The 
inequality  (5.32b)  is  readily  obtained  from  (5.31)  upon  making  use  of  (5.5)  and  (5.32a).  ■ 

The  growth  of  solutions  to  the  Poisson  equation  can  now  be  described. 

Corollary  5.5.  Assume  (R1)-(R6)  with  p  <  1  and  6  +  1  <  7.  There  exists  a  positive  constant  Bh 
such  that  for  every  r)  in  [0, 1], 

I  h(V,x)  |<  Bh  (1+  |  x  O  (5.34) 

as  x  ranges  over  ]N/f . 
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Proof:  By  the  remark  following  the  proof  of  Theorem  5.2,  the  mapping  rj  -*  C( 77,  0)  is  continuous 
on  [0, 1]  and  therefore  bounded  there.  By  straightforward  arguments, 


I  Kv,x)  |  <E% 


T— 1 


E  i «««)) i 


DM 

r,(o) 


sup  ICO,,  0)1 

0<»j<1 


<  £x  [Te(T)\  +  B1T1(x) 


(5.35) 


for  all  x  in  IN^,  with 


Bi  := 


1 

ri(o) 


•  sup  |C(t?,0)|. 


(5.36) 


The  passage  from  (5.35)  to  (5.34)  is  validated  by  the  same  arguments  as  the  ones  given  in  the  proof 
of  Corollary  5.4.  ■ 

Finally,  this  section  concludes  with  a  bound  on  the  moments  of  the  RVs  {h(r)(n),X(n+ 1)),  n  = 

0,1,...}. 

Corollary  5.6.  Assume  (Rl)-(R6)  with  p  <  1  and  r(6  +  1)  +  1  <  7  for  some  integer  r  =  0, 1, _ 

There  exists  a  positive  constant  Hr  such  that  the  bound 


suPn  Ea[\h(r)(n),X(n+  l))|r]  <  Hr  (5.37) 

holds. 

Proof:  Corollary  5.5  immediately  implies 

|  h(rj, X )  r<  \2Bh\r  (1+  I  x  r<5+1))  ,  X  in  1Nk  (5.38) 

for  every  rj  in  [0,1],  so  that 

Ea[ |  h(V(n),X(n  +  1))  T]  <  |2J9fc|r  (l  +  Ea[\  X(n  +  1)  |r({+1)])  .  ft  =  0, 1, . . .(5.39) 

The  conclusion  (5.37)  is  now  obtained  from  Theorem  4.1  upon  selecting  Hr  =  |2J9/l|r(l  +  K. y)  since 
r(6  +  1)  <  7  —  1.  ■ 

6.  CONVERGENCE  OF  THE  STOCHASTIC  APPROXIMATIONS 
6.1.  The  ODE  method 

This  section  is  devoted  to  proving  the  convergence  of  the  recursive  scheme  (3.5)-(3.6)  when 
the  policy  a  is  in  use.  For  sake  of  concreteness,  the  mapping  rj  — *•  J(/T?)  is  assumed  monotone 
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increasing  throughout  the  discussion.  This  is  done  with  the  understanding  that  were  the  mapping 
r i  — >  J(P)  monotone  decreasing,  the  recursion  (3.6)  would  have  to  be  changed  accordingly  and  the 
technical  conditions  modified  in  an  obvious  manner.  Details  are  left  to  the  interested  reader. 

The  following  additional  assumption  (R7)  is  imposed  in  order  to  carry  out  the  analysis. 

(R7)  The  equation 

J(P)  =  V,  0  <  r)  <  1  (6.1) 

has  a  unique  solution  77*,  and  for  some  e  >  0, 

\J(fv)-V](r)-V*)>  0  (6.2) 

whenever  77  77*  and  \t)  —  rj*\  <  e  in  [0,1]. 

The  condition  (6.2)  is  tantamount  to  local  monotonicity  and  in  practice,  is  often  verified  by  estab¬ 
lishing  some  stronger  monotonicity  property  on  77  — ►  J(fv )  such  as  (R7bis)  below. 

(R7bis)  The  mapping  [0, 1]  — *  Bt  :  77  — ►  J(P)  is  strictly  monotone,  say  monotone  increasing  for 
sake  of  definiteness. 

In  Section  7,  condition  (R7bis)  is  shown  to  hold  for  a  steering  problem  which  arises  from  a  con¬ 
strained  optimization  problem. 

The  proof  of  Theorem  3.1  given  below  uses  a  version  of  the  ODE  method  which  was  proposed 
by  Metivier  and  Priouret  in  [20].  The  arguments  combine  the  deterministic  lemma  of  Kushner  and 
Clark  [14]  with  a  probabilistic  result  based  on  properties  of  the  Poisson  equation  (5.2).  This  key 
result  is  given  the  next  proposition,  the  proof  of  which  is  delayed  till  the  second  part  of  the  section. 
To  state  the  result,  consider  the  RVs  {F(n),  n  =  0, 1, . . .}  given  by 

Y(n)  :=  J(pW)  -  c(X{n  +1))  n  =  0, 1, . .  .(6.3) 


and  for  every  t  >  0,  pose 

fc-i 

m(n,t )  :=  ma x{k  >  n  :  ^  a*  <  t}  .  n  =  0, 1, . .  .(6.4) 

t=n 


Theorem  6.1  Assume  (R1)-(R6)  with  p  <  1  and  26  +  3  <  7.  For  each  t  >  0  the  convergence 


limn  ( 


sup  |^OiY(i) 

\n<k<m(n,t)  i_n 


=  0 


Pa  -  a.s. 


(6.5) 
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takes  place. 

Proof  of  Theorem  3.1.  As  shown  in  [14,  20],  the  convergence  (6.5)  underlines  the  Pa- a.s. 
convergence  of  the  estimates  {77(77),  n  =  0, 1, . . .}  to  rj*.  The  reader  is  invited  to  consult  [14,20] 
for  a  complete  exposition  of  the  arguments  which  are  now  briefly  summarized:  Interpolate  the 
estimate  sequence  {77(77),  n  =  0, 1, . . .}  by  a  piecewise  linear  function  77 :  [0,  oo)  — >•  1R  such  that 
rf°\tn)  =  77(71)  at  time  tn  =  for  all  n  =  0, 1, . . ..  Moreover,  define  a  sequence  of  left  shifts 

{J/nH‘)>  n  =  0,1,...},  i.e.,  r/(n\t)  =  77 °(t  —  tn )  for  all  t  >  0,  in  order  to  bring  the  “asymptotic 
part”  of  {77(71),  n  =  0, 1, . . .}  back  to  a  neighborhood  of  the  time  origin. 

Now  observe  that  the  recursion  (3.6)  can  be  written  in  the  form 


77(77+  1)  = 


v(n)  +  an+1[{V-J(f^))+Y( »)] 


Jo 


77  =  0,1,... (6.6) 


and  that  from  any  convergent  subsequence  m  =  0,1,...}  a  further  convergent  subse¬ 

quence  {T}(mp\-),  p  =  0, 1, . . .}  can  then  be  extracted  by  standard  boundedness  and  equicontinuity 
arguments.  It  is  then  easy  to  see  from  Theorem  6.1  that  its  limit  r](-),  and  for  that  matter  the  limit 
of  any  convergent  subsequent,  satisfies  the  ODE 

77 =  t>  0,  ^(0)  in  [0,1].  (6.7) 


Owing  to  (R7),  this  ODE  is  asymptotically  stable  with  a  unique  stable  point  77*  in  [0, 1].  A  simple 
shifting  argument  now  implies  77*  (t)  =  77*  for  all  t  >  0  and  this  completes  the  proof.  These 
arguments  are  standard  and  are  therefore  omitted  here  in  the  interest  of  brevity.  ■ 

The  remainder  of  this  section  is  devoted  to  a  proof  of  (6.5). 

6.2.  A  proof  of  Theorem  6.1 

The  Poisson  equation  (5.2)  implies  the  relations 


ET>[h(r},X(n+  1))  |  Pn]  =  h(Tj,X(n))+  J{q)  -  c(X(n))  n  =  0,1,... (6.8) 


for  all  0  <  77  <  1.  It  then  follows  from  (5.8b)  and  (6.3)  that 

-Y(n)  =  c(X(n  +  1))  -  J(r](n)) 

=  h(V(n),  X(n  +  1))  -  ^">[h(77(7i),  X(n  +  2))  |  Xn+1] 

=  n  =  0,1,...  (6.9) 
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with 

ZM  :=  h(ri(n),X(n+  1))  -  E^[h(r)(n),X(n+  1))  |  jFn],  (6.10a) 

Z(2)  :=  E^[h(r,(n),X(n  +  1))  |  Tn]  -  E*n+»[h(r,(n+  l),X(n  +  2))  |  JTn+1]  (6.106) 

and 

Z(3)  ;=  ET>(n+1)[h(r}(n  +  l),X(n  -f  2))  |  Pn+1]  -  E^[h(r)(n),X(n  +  2))  |  Pn+a]  (6.10c) 
for  all  n  =  0, 1, . . ..  It  now  suffices  to  show  that 


limn  (  sup  |  V]  aiZjk) 

\n<e<m(n,t )  j=n 


=  0  Pa  -  a.s. 


(6.11) 


for  all  t  >  0  and  all  k  =  1, 2, 3. 

This  will  be  done  in  three  steps.  To  facilitate  the  presentation,  define  the  RVs  {S'hk\  n  = 
0, 1, . . .}  by 

n— 1 

Slk)  :=  aiz\k)  n  =  1,2, . .  .(6.12) 

i=0 


for  k  =  1,2,3,  with  5q1}  =  Sj2)  =  s£3)  =  0. 

Step  1:  The  RVs  {Z^ ,  n  =  0, 1,...}  form  a  (Pa,Fn)  martingale-difference,  whence  n  = 

0,1,.. .}  is  a  zero  mean  (Pa,  Pn)-martingale.  Routine  calculations  show  that 


supn  Ea[\  |2]  =  supn  Ea 


n— 1 


2  |  y(l)  |2 


E«?i 


t=0 


00 

<  supnPa  |h(??(n),x(ra-|-  1))|2  -4^af 


i=0 


<  4P2-£a?. 

i=0 


(6.13) 

(6.14) 

(6.15) 


The  passage  from  (6.14)  to  (6.15)  uses  the  estimate  (5.37)  given  in  Corollary  5.6  (with  r  =  2 
since  28  4-  2  <  7  —  1).  It  is  plain  from  (3.7)  that  the  left  handside  of  (6.13)  is  finite,  and  the 
(P“,  pn  )-martingale  {5l^,  n  =  0, 1, . . .}  is  thus  uniformly  integrable  under  Pa.  By  the  Martingale 
Convergence  Theorem,  the  RVs  {5^,  n  =  0, 1, . . .}  converge  a.s.  under  Pa  (to  an  a.s.  finite  limit), 
in  which  case  they  form  a  Cauchy  sequence  Pa-a.s.  and  (6.11)  follows  for  k  =  1. 


28 


Step  2:  For  k  =  2,  note  first  the  relations 

=  E»<z‘2) 

i=n 

i 

=  -  £>;_!  -  a^E^ihirjii),  X(i  +  1))  |  ft]  (6.16) 

i~n 

+  an-1E^[h(r}(n),X(n+  1))  |  JF„]  -  aeE^+^[h(V(£  +  1),X(£+  2))  |  Fw] 
valid  for  all  0  <  n  <  l.  Define  the  RVs  {lifn,  n  =  0, 1, . . .}  by 

Kn  :=  E^[h(rj(n),  X(n  +  1))  |  Tn]  n  =  0, 1, . . .  (6.17) 

and  set 

Br  =  supn  Ea  [|  Kn  |r] .  r  =  1,2,...  (6.18) 

It  is  clear  from  (5.37)  (with  r  =  1,2)  and  Jensen’s  inequality  that  B\  <  II\  <  oo  and  B^  <  II 2  <  00. 
With  this  notation,  (6.16)  can  be  rewritten  as 

1 

I  sgi  -  SW  \<  an^\Kn\  +  ][>,_!  -  aOl^il  +  ae\Kt+i\  (6.19) 

t=n 

for  all  0  <  n  <  t  since  an  [  0.  Upon  defining  the  RVs  {5„,  n—  1,2...}  and  {Rn,  n  =  0, 1, . . .}  by 

n 

Sn  =  £)(ai_i  -  ai)|jSTt|  n  =  1,2,... (6.20) 

1=1 

and 

n 

Rn  =  ^2  l°*|2|^+i|2*  n  =  0, 1, . .  .(6.21) 

t=0 

(6.19)  now  becomes 

I  Sgi  -  SW  |<  an^\Kn\+  |  S*  -  5n+1  |  +o<|Jir/+i|  (6.22) 

for  all  0  <  n  <  l. 

The  definition  (6.20)  implies 

n 

Ea[Sn]<B-L^2  (a,— i  -  d{)  =  Bi(a0  -  an)  <  Bia0.  n  =  0,1,.  ..(6.23) 

i=0 
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Since  Sn  <  Sn+ 1,  the  limit  5^  :=  lim„5n  exists  and  therefore  £'a[5O0]  <  B\(iq  by  using  the 
Monotone  Convergence  Theorem  on  (6.23).  Consequently,  Soo  <  oo  Pa- a.s.  and  the  RVs  {Sn,  n  = 
0, 1, . . .}  form  a  Cauchy  sequence  Pa- a.s.,  i.e., 


lim„  sup*,>n  |  St  -  5„+ 1  |=  0  Pa  -  a.s.  (6.24) 

To  handle  the  first  and  last  terms  of  (6.22),  observe  that  Rn  <  Rn+ 1,  hence  the  limit  R^  lim„.ftn 
exists  and  satisfies 

OO 

Ea[R0 o]  <52^o?<  oo  (6.25) 

»=o 

by  virtue  of  the  Monotone  Convergence  Theorem.  Consequently,  lim„jRn  =  R0 0  <  oo  Pa- a.s., 
whence  lim„atl_i|iiL’n|  =  0  Pa- a.s.  or  equivalently 

lim„  sup^>n  =  0  Pa  -  a.s.  (6.26) 

by  the  Cauchy  convergence  criterion.  Making  use  of  (6.24)  and  (6.26)  readily  leads  (via  (6.22))  to 
the  conclusion  (6.11)  for  k  =  2. 

Step  3:  For  k  =  3,  observe  that  (6.8)  and  the  estimates  of  Corollary  5.3  readily  yield  the  estimates 

|  Er>[h(V,X(n+  1))  |  JF„]  -  E«[h{ »?,X(n  +  1))  |  Tn\  \ 

=  I  Kv,  X{n))~  h(ri,X(n))  + J(fi)  | 

<  4 K{X(n))- \rj-fj\  n  =  0,1,... (6.27) 

for  all  77  and  i)  in  [0,1],  where 

K(x)  :=  K(x)  +  2^Ti(x),  x  in  IN .  (6.28) 

The  recursion  (3.6)  implies 

|  7?(n  +  1)  -  r,(n )  |<  an+1  |  V  -  c(X(n  +  1))  |  n  =  0, 1, . .  .(6.29) 

and  the  inequality 

I  Z(n3)  |<  4an+1Q(X(n  +  1))  n  =  0, 1, . .  .(6.30) 

is  now  obtained  from  (6.27),  upon  setting 

Q(a:):=R'(a:)(F  +  |c(a:)|),  x  in  IN*  .  (6.31) 
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Under  (R5),  with  the  help  of  (5.5)  and  (5.32a),  it  is  a  simple  exercise  to  check  that 


Q(x)<C{l  +  \x\w),  a;  in  IN 


K 


for  some  positive  constant  C.  Consequently, 


r“[f>|z!3)l 

-j=0 


<C-^Ta]Ea  1  +  |X(*+1)|25+2 

t=0 

oo 

<C(l +  *«,)• 


i'=0 


(6.32) 


(6.33) 

(6.34) 


for  all  n  =  0, 1, . . .,  where  the  passage  from  (6.33)  to  (6.34)  is  a  simple  consequence  of  (4.1)  (since 
2 <5  4-  2  <  7  —  1).  Now,  in  exactly  the  same  way  as  in  Step  2  of  the  proof,  this  uniform  bound  (6.34) 
implies 

e 

limn  sup*>n  =0  Pa  -  a.s.  (6.35) 

i~n 

and  (6.11)  obviously  holds  for  k  =  3.  ■ 

7.  CONVERGENCE  OF  THE  ADAPTIVE  POLICY  AND  APPLICATIONS 


This  final  section  contains  a  proof  of  Theorem  3.2,  as  well  as  the  discussion  of  an  application  that 
arises  in  constrained  optimization. 

7.1.  A  proof  of  Theorem  3.2. 

The  proof  follows  from  the  general  results  obtained  by  the  authors  [24]  on  the  Certainty  Equiv¬ 
alence  Principle  when  specialized  to  “simply  randomized”  policies.  First  note  that  the  (assumed) 
convergence  lim„  r?(n)  =  77*  in  probability  under  Pa,  when  combined  to  Theorem  7.2  of  [24],  implies 
the  key  convergence  condition  (C)  [Ibid.,  Section  4].  Consequently,  the  convergence  (3.11)-(3.12) 
follows  from  Theorem  3. Ibis  in  [Ibid.]  provided  the  hypotheses  of  Theorems  4.2  and  5. Ibis  of  [Ibid.] 
are  satisfied.  These  hypotheses  consist  in  the  tightness  of  the  RVs  {X(t),  t  =  0, 1, . . .}  under  Pa 
and  of  bounds  on  the  moments  of  the  RVs  {c(X(t)),h(r]*,X(t)),  t  =  0,1,...}  under  various  policies. 
It  is  easy  to  check  that  these  conditions  are  all  implied  by  the  following  condition: 

There  exist  e  >  0  and  a  positive  constant  Ce  such  that  for  every  non-idling  policy  tt  in  V,  the 
bounds 

sup,  E-[|X(#)|1+e]  <C£,  (7.1) 
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sup,  E*[\c(X(t))\1+‘]  <  ce 


(7.2) 


and 

supf^[|%*,X(t))|1+6]  <C£  (7.3) 

hold. 

Observe  that  by  virtue  of  Theorem  4.1,  the  bound  (7.1)  readily  holds  whenever  1  +  e  <  7  -  1. 
By  assumption,  c  is  of  polynomial  growth  with  rate  6,  so  that  (7.2)  holds  if  £(1  +  e)  <  7  -  1  by  the 
remark  made  earlier.  To  obtain  the  third  bound  (7.3),  observe  from  (5.34)  that  for  every  e  >  0, 

\h(V*,X(n))\1+e  <  \2Bh\1+((l  +  jX(n)|(i+1>(£+1>),  n  =  0,1,... (7.4) 

and  (7.3)  follows  with  (1  +  c)(l  +  S)  <  7  —  1  by  again  making  use  of  Theorem  4.1.  Consequently 
(7.1)-(7.3)  will  hold  provided  e  is  chosen  positive  such  that  1  +  (1  +  £)(1  +  e)  <  7. 

An  identical  analysis  applies  for  the  long-run  average  cost  associated  with  d;  details  are  left  to 
the  interested  reader.  ■ 

7.2  An  application  to  constrained  optimization 

Consider  the  following  situation  discussed  by  Nain  and  Ross  in  [21].  Several  types  of  traffic, 
say  voice,  video  and  data,  compete  for  the  use  of  a  single  resource  (or  server).  The  performance 
requirements  for  this  system  are  defined  by  the  minimization  of  a  weighted  average  of  the  number  of 
video  and  data  packets  subject  to  the  constraint  that  the  average  number  of  voice  packets  waiting 
for  service  does  not  exceed  V.  This  situation  can  be  modelled  by  a  system  of  K  competing  queues 
with  P  =  0.  For  a  precise  definition  of  the  performance  measures,  set 

K- 1 

c(x)  :=  xk  and  d(x)  :=  ^  dkXk  (7.5) 

fe= 1 

for  all  x  in  IN  ,  where  d\,. .  .,dx-i  are  positive  constants.  Denote  by  Jc(n)  (resp.  Jd(i r))  the 
long-run  average  cost  (3.1)  associated  with  the  cost  c  (resp.  d)  when  using  the  policy  ix  in  V.  The 
constrained  optimization  problem  (Pv)  is  then  formulated  as 

(Pv)  :  Minimize  Jd( •)  over  Vv  (7.6) 

where  Vv  ■—  {tt  €  V  :  Jc{v)  <  ^}- 
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Assume  the  problem  to  be  feasible  and  non-trivial,  i.e.,  Vv  is  non-empty  and  the  policies  which 
minimize  Jd  are  not  in  Vv-  In  that  case,  Nain  and  Ross  [21]  showed  that  there  exist  two  strict 
priority  policies  <7  and  g_  and  a  bias  77*  satisfying  the  equation 

Ufv)  =  V,  >7  in  [0,1]  (7.7) 

such  that  P  defined  through  (3.3)  is  optimal.  While  the  policies  g  and  £  can  be  found  explicitly, 
the  determination  of  rj*  is  a  difficult  task  since  for  0  <  77  <  1  the  evaluation  of  JC(P )  requires 
solving  a  Riemann-Hilbert  problem.  That  this  computational  difficulty  can  be  circumvented  by 
making  using  of  a  Stochastic  Approximation-based  policy  is  the  content  of  the  following. 
Theorem  7.1  Assume  (R1)-(R5)  with  p  <  1  and  7  >  5.  The  scheme  (3.5)-(3.6)  solves  the 
constrained  optimization  problem  (Pv)  provided  it  is  feasible. 

Proof:  Nain  and  Ross  [21,  Thm.  3.1,  pp.  885-886]  showed  that  if  the  problem  is  feasible  and 
non-trivial,  then  there  exist  Markov  stationary  policies  ~g  and  g  such  that  (7.7)  has  at  least  one 
solution.  In  fact,  both  policies  are  fixed  priority  policies  with  g  giving  highest  priority  to  queue  K, 
and  g  giving  lowest  priority  to  queue  K,  while  the  relative  priorities  of  the  other  queues  otherwise 
identical.  Moreover,  the  mapping  rj  — >  Jd(P)  is  monotone  non-decreasing.  It  is  shown  in  Lemma 
7.2  below  that  this  mapping  is  in  fact  strictly  monotone  increasing.  When  7  >  5,  the  conditions 
of  Theorems  3.1  and  3.2  are  readily  verified  with  8  =  1.  Hence,  lim„  77(71)  =  77*  Pa- a.s.  so  that 
Jc(a )  =  JC{P* )  =  V  and  Jd(a)  =  Jd(P*),  be-,  a  is  a  policy  in  Vv  and  is  thus  also  constrained 
optimal. 

If  the  problem  is  trivial,  i.e.,  Jc(g)  <  V,  then  g  solves  (Vv)-  In  that  case,  the  same  arguments 
imply  that  limn  77(71)  =  1  Pa  —  a.s.,  and  optimality  follows.  ■ 

In  the  case  K  =  2,  the  two  policies  <7  and  g  are  necessarily  the  fixed  priority  rules  for  queue  1 
and  2,  respectively.  In  this  case,  the  adaptive  policy  does  not  assume  any  prior  information  on 
the  statistics  of  the  system,  provided  (R1)-(R5)  hold  with  7  >  5.  In  this  case,  the  optimality  was 
obtained  by  Shwartz  and  Makowski  [23]  under  a  slightly  weaker  assumption  (namely  7  >  3),  but 
the  convergence  (3.10)  was  only  in  probability. 

This  section  concludes  with  the  following  monotonicity  result  which  was  needed  in  the  proof 
of  Theorem  7.1. 

Lemma  7.2.  Under  (R1)-(R5)  the  mapping  77  — ►  JC{P)  Is  strictly  monotone  increasing  on  [0,1]. 
Proof:  It  is  plain  from  (5.8)  that  proving  the  strict  monotonicity  of  77  — ►  JC(P)  is  equivalent  to 
proving  the  same  for  77  — ►  C{ 77, 0).  Fix  77  in  [0, 1]  and  recall  the  definition  (3.3)  of  the  policy  fn. 
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The  representation  (5.22)  of  the  derivative  of  Cm(??,0)  can  be  written  in  the  form 


m— 1 


Cm(V,0)=J2Eo 


e-i 


t= o 
m-1  m 

=  £  E 

t= o  e=t+ i 


i=i 


£iP'  =  <]Eip,<mprK« 

4=0 

l- 1 

1[T  =  <]£**(«) 


K' 


e-i 


£l[T  =  £]£l[T<m]X/c(5) 


U=1 


s=0 


4=0 


-K 


£-1 


lpT  =  <]£**(.) 


s=0 


(7.8) 


where  (5.23)  was  used.  If  it  were  possible  to  show  bounds  of  the  form 

A (£,t,s)  :=  El*  [1[T  =  £}Xk(s)}  -  E°0'  [1[T  =  £]XK(s )]  >  e(£,t,s)  (7.9) 

with  e(£,t,s )  >  0  for  all  0  <  s  <  £  and  0  <  t  <  £,  and  c(£,  t,s)  >  0  for  at  least  one  such  triple 
(£,t,s),  then  necessarily  for  some  m,  0  <  Cm(r],  0)  <  Cm+i(r),  0)  <  . . .  and  the  strict  monotonicity 
would  follow  from  the  second  equality  in  (5.28). 

Fix  t  and  £  such  that  0  <t<£.  It  is  easy  to  see  that  A(£,  t,  s)  —  0  whenever  0  <  s  <  t  <  £, 
so  that  only  the  case  0  <  t  <  s  has  to  be  considered  in  order  to  prove  (7.9).  This  is  done  by  the 
following  coupling  arguments. 

Let  P  be  a  probability  measure  on  (O,  T)  under  which  (P1)-(P4)  hold  and  X (0)  =  0.  Moreover, 
let  {(3(n),  n  =  0,1,...}  be  a  sequence  of  i.i.d.  Bernoulli  RVs  with  parameter  i]  which  is  also 
independent  of  the  RVs  (A(n),  B(n),  n  =  0, 1, . . .}  under  P. 

The  key  point  of  the  proof  is  to  construct  on  SI  a  pair  of  processes  (X°(ra),  n  —  0, 1, . . .}  and 
{X1(n),  n  =  0, 1, . . .}  such  that  (i)  {X°(n),  n  =  0, 1, . . .}  (resp.  {X1(n),  n  —  0, 1, . . .})  under  P 
is  statistically  indistinguishable  from  {X(n),  n  —  0, 1, . . .}  under  P^1  (resp.  Pq*),  and  (ii)  a  simple 
comparison  leads  to  (7.9).  To  that  end,  for  each  i  =  0, 1,  define  the  process  (X‘(n),  n  =  0, 1, . . .} 
by  the  recursion 

X'k(n  +  1)  =  X'k(n)  +  Ak(n)  -  I[X'k(n)  ±  0]U'k(n)Bi(n)  n  =  0, 1, . .  .(7.10) 

with  X‘(0)  =  0,  where  the  sequences  {U'(n),  n  =  0, 1, . . .}  and  {P'(ra),  n  =  0, 1, . . .}  still  need  to 
be  specified. 

For  i  =  0, 1,  the  control  actions  {Ul(n),  n  =  0, 1, . . .}  are  defined  by 


U\n)  =  f3(n)  fif(X‘(n))  +  (1  -  /?(n))  g(X\n))  ,  n  ±  t 

(7.11a) 

U°(t)  =  ff(X°(t)) 

(7.116) 

U\t)  =  g(X\t)) 

(7.11c) 
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so  that  the  RVs  (X°(n),  n  =  0, 1, . . .}  (resp.  {X1(n),  n  =  0, 1, . . .})  are  governed  by  the  policy  0t 
(resp.  lt). 

Only  the  service  sequences  (P*(ra),  n  =  0,1,...},  i  =  0,1,  need  to  be  specified.  First, 
set  B°(n)  =  B(n )  for  all  n  =  0,1,...  and  observe  from  the  construction  (7.10)-(7.11)  that 
the  distribution  of  {X°(n),  n  =  0,1,...}  under  P  obviously  coincides  with  the  distribution  of 
{X(n),  n  =  0, 1, . . .}  under  Pq' .  The  construction  of  the  process  {51(n),  n  =  0, 1, . . .}  is  some¬ 
what  more  involved,  and  is  done  below.  In  order  to  facilitate  the  coupling  argument,  the  actual 
service  duration  of  each  customer  will  be  defined  in  such  a  way  so  as  to  have  identical  length  (for 
each  u>  in  ft)  in  both  processes.  Set 

B1(n):=B(n)  »  =  0,1,...,*  -  1  (7.12) 

and  observe  from  (7.10)  that  in  order  to  determine  the  process  {Xx(n),  n  —  0,1,...},  it  suffices 
to  provide  the  values  of  B\(n )  at  times  n  such  that  I/X(n)  =  ek,  1  <  k  <  K.  For  all  i  =  0, 1  and 
1  <  k  <  K,  set 

r£(l)  :=  min{n  >  t  :  U\n)  =  ek }  (7.136) 

Tlk{l)  :=  min{n  >  Tk{l  -  1)  :  U\n)  —  ek},  l  —  2,3, . .  .(7.136) 

and  define 

:=  Bk(r°k(£)),  l<k<K  £  =  1,2 , . . .  (7.14) 

With  these  definitions,  the  actual  number  of  times  each  customer  is  served  is  identical  in  both 
systems,  while  the  sequences  {B(n),  n  =  0, 1, . . .}  (under  Pj' )  and  (Px(n),  n  =  0, 1, . . .}  (under 
P)  are  statistically  indistinguishable.  Consequently,  the  distribution  of  (Xx(n),  n  =  0, 1, . . .}  under 
P  coincides  with  the  distribution  of  (X(n),  n  =  0,1,...}  under  P}‘ .  Moreover,  by  construction 
(with  the  notation  of  (5.3)),  it  is  easy  to  see  that  T°  =  T1  and  X)}(n)  <  X^-(n)  for  all  n  =  0, 1, . . . 
P  a.s.,  whence 

A =  E  [1[TX  =  l]  (Xk(s)  -  X£(s))]  >  0.  (7.15) 

Finally,  for  s  =  f  +  l,  observe  that  on  the  event 

A  :=  [r°  =  £\  n  [ X°K(t )  ±  0]  n  [X^(t)  ^  0  for  some  k  =  1, 2, . . . ,  K  —  l]  n  [BK(t)  =  1] ,  (7.16) 
the  equality  X^^t  +  1)  —  X®K{t  +  1)  =  1  holds,  and  that  P[A]  >  0.  Consequently, 

E  [1[TX  =  £]  [X^s)  -  X^(S)]]  :=  e(£,  t,  t  +  1)  >  P[A)  >  0  (7.17) 


and  the  result  is  established. 
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